Efficient Hierarchical Clustering of Large Data Sets Using P-trees
نویسندگان
چکیده
Hierarchical clustering methods have attracted much attention by giving the user a maximum amount of flexibility. Rather than requiring parameter choices to be predetermined, the result represents all possible levels of granularity. In this paper a hierarchical method is introduced that is fundamentally related to partitioning methods, such as k-medoids and k-means as well as to a density based method, namely center-defined DENCLUE. It is superior to both kmeans and k-medoids in its reduction of outlier influence. Nevertheless it avoids both the time complexity of some partition-based algorithms and the storage requirements of density-based ones. An implementation is presented that is particularly suited to spatial-, stream-, and multimedia data, using P-trees for efficient data storage and access.
منابع مشابه
Dependent nonparametric trees for dynamic hierarchical clustering
Hierarchical clustering methods offer an intuitive and powerful way to model a wide variety of data sets. However, the assumption of a fixed hierarchy is often overly restrictive when working with data generated over a period of time: We expect both the structure of our hierarchy, and the parameters of the clusters, to evolve with time. In this paper, we present a distribution over collections ...
متن کاملClustering Large Data Sets with Mixed Numeric and Categorical Values
Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we pres...
متن کاملA Hierarchical Approach for Clusters in Different Densities
Clustering has the following challenges: 1) clusters with arbitrary shapes; 2) minimal domain knowledge to determine the input parameters; 3) scalability for large data sets. Density-based clustering has been recognized as a powerful approach for discovering clusters with arbitrary shapes. However, the other two challenges still remain in most existing clustering algorithms. In this paper, we e...
متن کاملDensity Modeling and Clustering Using Dirichlet Diffusion Trees
I introduce a family of prior distributions over multivariate distributions, based on the use of a “Dirichlet diffusion tree” to generate exchangeable data sets. These priors can be viewed as generalizations of Dirichlet processes and of Dirichlet process mixtures, but unlike simple mixtures, they can capture the hierarchical structure present in many distributions, by means of the latent diffu...
متن کاملA natural framework for sparse hierarchical clustering
There has been a surge in the number of large and flat data sets – data sets containing a large number of features and a relatively small number of observations – due to the growing ability to collect and store information in medical research and other fields. Hierarchical clustering is a widely used clustering tool. In hierarchical clustering, large and flat data sets may allow for a better co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002